Goto

Collaborating Authors

 perception performance



Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps

Neural Information Processing Systems

Multi-agent collaborative perception could significantly upgrade the perception performance by enabling agents to share complementary information with each other through communication. It inevitably results in a fundamental trade-off between perception performance and communication bandwidth. To tackle this bottleneck issue, we propose a spatial confidence map, which reflects the spatial heterogeneity of perceptual information. It empowers agents to only share spatially sparse, yet perceptually critical information, contributing to where to communicate. Based on this novel spatial confidence map, we propose Where2comm, a communication-efficient collaborative perception framework. Where2comm has two distinct advantages: i) it considers pragmatic compression and uses less communication to achieve higher perception performance by focusing on perceptually critical areas; and ii) it can handle varying communication bandwidth by dynamically adjusting spatial areas involved in communication. To evaluate Where2comm, we consider 3D object detection in both real-world and simulation scenarios with two modalities (camera/LiDAR) and two agent types (cars/drones) on four datasets: OPV2V, V2X-Sim, DAIR-V2X, and our original CoPerception-UAVs. Where2comm consistently outperforms previous methods; for example, it achieves more than $100,000 \times$ lower communication volume and still outperforms DiscoNet and V2X-ViT on OPV2V.


Where2comm: Communication-Efficient

Neural Information Processing Systems

Multi-agent collaborative perception could significantly upgrade the perception performance by enabling agents to share complementary information with each other through communication. It inevitably results in a fundamental trade-off between perception performance and communication bandwidth.


When Autonomous Vehicle Meets V2X Cooperative Perception: How Far Are We?

Guo, An, Zhang, Shuoxiao, Tang, Enyi, Gao, Xinyu, Pang, Haomin, Tian, Haoxiang, Mu, Yanzhou, Wen, Wu, Fang, Chunrong, Chen, Zhenyu

arXiv.org Artificial Intelligence

With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) cooperative perception has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. V2X cooperative perception systems are software systems characterized by diverse sensor types and cooperative agents, varying fusion schemes, and operation under different communication conditions. Therefore, their complex composition gives rise to numerous operational challenges. Furthermore, when cooperative perception systems produce erroneous predictions, the types of errors and their underlying causes remain insufficiently explored. To bridge this gap, we take an initial step by conducting an empirical study of V2X cooperative perception. To systematically evaluate the impact of cooperative perception on the ego vehicle's perception performance, we identify and analyze six prevalent error patterns in cooperative perception systems. We further conduct a systematic evaluation of the critical components of these systems through our large-scale study and identify the following key findings: (1) The LiDAR-based cooperation configuration exhibits the highest perception performance; (2) Vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication exhibit distinct cooperative perception performance under different fusion schemes; (3) Increased cooperative perception errors may result in a higher frequency of driving violations; (4) Cooperative perception systems are not robust against communication interference when running online. Our results reveal potential risks and vulnerabilities in critical components of cooperative perception systems. We hope that our findings can better promote the design and repair of cooperative perception systems.


Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception

Li, Yang, Yuan, Quan, Luo, Guiyang, Fu, Xiaoyuan, Pan, Rui, Yang, Yujia, Shao, Congzhang, Liu, Yuewen, Li, Jinglin

arXiv.org Artificial Intelligence

Collaborative perception allows agents to enhance their perceptual capabilities by exchanging intermediate features. Existing methods typically organize these intermediate features as 2D bird's-eye-view (BEV) representations, which discard critical fine-grained 3D structural cues essential for accurate object recognition and localization. To this end, we first introduce point-level tokens as intermediate representations for collaborative perception. However, point-cloud data are inherently unordered, massive, and position-sensitive, making it challenging to produce compact and aligned point-level token sequences that preserve detailed structural information. Therefore, we present CoPLOT, a novel Collaborative perception framework that utilizes Point-Level Optimized Tokens. It incorporates a point-native processing pipeline, including token reordering, sequence modeling, and multi-agent spatial alignment. A semantic-aware token reordering module generates adaptive 1D reorderings by leveraging scene-level and token-level semantic information. A frequency-enhanced state space model captures long-range sequence dependencies across both spatial and spectral domains, improving the differentiation between foreground tokens and background clutter. Lastly, a neighbor-to-ego alignment module applies a closed-loop process, combining global agent-level correction with local token-level refinement to mitigate localization noise. Extensive experiments on both simulated and real-world datasets show that CoPLOT outperforms state-of-the-art models, with even lower communication and computation overhead. Code will be available at https://github.com/CheeryLeeyy/CoPLOT.


Generalist Forecasting with Frozen Video Models via Latent Diffusion

Walker, Jacob C, Vélez, Pedro, Cabrera, Luisa Polania, Zhou, Guangyao, Kabra, Rishabh, Doersch, Carl, Ovsjanikov, Maks, Carreira, João, Ginosar, Shiry

arXiv.org Artificial Intelligence

Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model's perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.


Fast2comm:Collaborative perception combined with prior knowledge

Zhang, Zhengbin, Wu, Yan, Zhang, Hongkun

arXiv.org Artificial Intelligence

Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at https://github.com/Zhangzhengbin-TJ/Fast2comm.


Towards Efficient Roadside LiDAR Deployment: A Fast Surrogate Metric Based on Entropy-Guided Visibility

Jiang, Yuze, Javanmardi, Ehsan, Tsukada, Manabu, Esaki, Hiroshi

arXiv.org Artificial Intelligence

The deployment of roadside LiDAR sensors plays a crucial role in the development of Cooperative Intelligent Transport Systems (C-ITS). However, the high cost of LiDAR sensors necessitates efficient placement strategies to maximize detection performance. Traditional roadside LiDAR deployment methods rely on expert insight, making them time-consuming. Automating this process, however, demands extensive computation, as it requires not only visibility evaluation but also assessing detection performance across different LiDAR placements. To address this challenge, we propose a fast surrogate metric, the Entropy-Guided Visibility Score (EGVS), based on information gain to evaluate object detection performance in roadside LiDAR configurations. EGVS leverages Traffic Probabilistic Occupancy Grids (TPOG) to prioritize critical areas and employs entropy-based calculations to quantify the information captured by LiDAR beams. This eliminates the need for direct detection performance evaluation, which typically requires extensive labeling and computational resources. By integrating EGVS into the optimization process, we significantly accelerate the search for optimal LiDAR configurations. Experimental results using the AWSIM simulator demonstrate that EGVS strongly correlates with Average Precision (AP) scores and effectively predicts object detection performance. This approach offers a computationally efficient solution for roadside LiDAR deployment, facilitating scalable smart infrastructure development.


Fresh2comm: Information Freshness Optimized Collaborative Perception

Wu, Ziyong, Peng, Zhilin, Yu, Lei

arXiv.org Artificial Intelligence

Collaborative perception is a cornerstone of intelligent connected vehicles, enabling them to share and integrate sensory data to enhance situational awareness. However, measuring the impact of high transmission delay and inconsistent delay on collaborative perception in real communication scenarios, as well as improving the effectiveness of collaborative perception under such conditions, remain significant challenges in the field. To address these challenges, we incorporate the key factor of information freshness into the collaborative perception mechanism and develop a model that systematically measures and analyzes the impacts of real-world communication on collaborative perception performance. This provides a new perspective for accurately evaluating and optimizing collaborative perception performance. We propose and validate an Age of Information (AoI)-based optimization framework that strategically allocates communication resources to effectively control the system's AoI, thereby significantly enhancing the freshness of information transmission and the accuracy of perception. Additionally, we introduce a novel experimental approach that comprehensively assesses the varying impacts of different types of delay on perception results, offering valuable insights for perception performance optimization under real-world communication scenarios.


Where2comm: Communication-Efficient Collaborative Perception via Spatial Confidence Maps

Neural Information Processing Systems

Multi-agent collaborative perception could significantly upgrade the perception performance by enabling agents to share complementary information with each other through communication. It inevitably results in a fundamental trade-off between perception performance and communication bandwidth. To tackle this bottleneck issue, we propose a spatial confidence map, which reflects the spatial heterogeneity of perceptual information. It empowers agents to only share spatially sparse, yet perceptually critical information, contributing to where to communicate. Based on this novel spatial confidence map, we propose Where2comm, a communication-efficient collaborative perception framework. Where2comm has two distinct advantages: i) it considers pragmatic compression and uses less communication to achieve higher perception performance by focusing on perceptually critical areas; and ii) it can handle varying communication bandwidth by dynamically adjusting spatial areas involved in communication.